A Journey into the Retail World!

Introduction

Brief Description

Brief Description


We have been provided with Kroger data for academic purose to analyse the datasets and come up with meaningful insights by applying all the data wrnagling and machine learning techniques that we have learnt during our Business Analytics program. The datasets that will be used are not complete datasets and we will be focussing on product, household and transaction datasets. The programming language that will be used to analyse the data will be R.


Problem Statement


To analyse the Transaction, Products and Household datasets of Kroger to answer the following questions:

  • How does customer engagement change over time?
  • Which demographic factors affect customer enegagement?
  • How many customers are spending more or less over time?
  • Of those who are spending more or less, which categories are having an impact?

Approach


Before we begin with our analysis, it is always important to have a look at the data and get a feel of it. So, I’ll be starting with looking at the structure of the data after importing the required datasets in R. Looking at the data helps you see what is wrong with the data and the methods that need to be applied to clean the data. Cleaning the data is an essential part of coming up with meaningful insights and can consume up to 60% of the time and effort. Beginning with the analysis without cleaning the data can have negative consequenses which can lead to undesired results and not so meaningful conclusion. Once the data is cleaned, we will move on to the exploratory data analysis and apply different machine learning techniques to generate some meaningful conclusions.

The analysis will try to answer the questions posed in the problem statement so that the stakeholders who are going to use the use this analysis can target the consumers in a right way to increase the sales of the company.


Packages Required

Packages Required


Following are the packages required to analyse the given datasets:

  • readr - It provides a fast and friendly way to read rectangular (csv, tsv) data
  • dplyr - dplyr provides verbs that helps in data manipulation of objects like data frame
  • lubridate - It is a package to to work with date and time
  • stringr - It is used for character manipulation, pattern matching in strings, etc.
  • DT - Helps in displaying data objects as tables on HTML pages
  • tidyverse - tidyverse is a set of packages. It is designed to load multiple packages simulataneously
  • caret - caret package contains functions to streamline the model training process for regression and classification problems
  • plotly - plotly is used for creating interactive webpages in R
  • ranger - ranger package is used for the implementation of random forest
  • zoo - zoo is used for performing calculations for irregualr time series
  • arules - arules is used for frequent items and association rules
  • arulesViz - viz techniques for association and rules and item-sets
  • knitr - knitr is used for dynamic report generation
  • plyr - plyr contains tools for splitting, applying and combining data

Data Preparation

Data source & description

Source

The data represents customer transactions over two years from a group of 2,500 households who are frequent shppers at Kroger. This doesn’t capture the effects of marketing to customers as the focus will only be on partial datsets and not complete datasets leaving out the datasets that contain the “campaign”" data. This data will be used only for academic purpose.


Data Description


Household dataset - It contains demographic information about the selected households

Variables and their description

  • HSHD_NUM - Uniquely identifies each household
  • LOYALTY_FLAG - Indicates the loyalty of the customer with the company
  • AGE_RANGE - Estimated age range
  • MARITAL_STATUS - Marital Status
  • INCOME_RANGE - Household Income
  • HOMEOWNER_DESC - Houseowner, renter
  • HSHD_COMPOSITION - Household composition
  • HH_SIZE - Size of household up to 5+
  • CHILDREN - Number of children present up to 3+

Product dataset - It contains information on each product

Variables and their description

  • PRODUCT_NUM - Uniquely identifies each product
  • DEPARTMENT - Groups similar product together at department level
  • COMMODITY - Groups similar product together at lowest level
  • BRAND_TYPE - Indicates Private or National Brand
  • ORGANIC_FLAG - Indicates whether the product is organic or not

Transaction dataset - It contains information on each transaction

Variables and their description

  • BASKET_NUM - Uniquely identifies a purchase occasion
  • HSHD_NUM - Uniquely identifies each household
  • PURCHASE_DATE - Date when transaction occured
  • PRODUCT_NUM - Uniquely identifies each product
  • SPEND - Amount of dollars retailer received from sale
  • UNITS - Number of items sold
  • STORE_REGION - Region where the store belongs to
  • WEEK_NUM - Week of transaction
  • YEAR - Year of sale

Data cleaning

Data Importing and Cleaning

Importing the dataset using the function read_csv from readr package. It came to the notice that some of the column names were not named properly as per the User Guide that was shared with us. So, rectifying the column names of those columns using the col_names attribute. Using the pmap function to import the datasets simultaneoulsy.

household_colnames <- c("HSHD_NUM","LOYALTY_FLAG","AGE_RANGE","MARITAL_STATUS",
                        "INCOME_RANGE","HOMEOWNER_DESC","HSHD_COMPOSITION",
                        "HH_SIZE","CHILDREN")
products_colnames <- c("PRODUCT_NUM","DEPARTMENT","COMMODITY","BRAND_TYPE",
                        "ORGANIC_FLAG")
transactions_colnames <- c("BASKET_NUM","HSHD_NUM","PURCHASE_DATE","PRODUCT_NUM",
                        "SPEND","UNITS","STORE_REGION",
                        "WEEK_NUM","YEAR")




files <- list("data/5000_households.csv", "data/5000_products.csv", 
              "data/5000_transactions.csv")
column_list <- list(household_colnames, products_colnames, transactions_colnames)
column_type <- list("ccccccccc", "ccccc", "ccccdicii")
skip_rows <- list(1, 1, 1)


df_combined <- pmap(list(file = files, col_names = column_list, col_types = column_type, 
                 skip = skip_rows), read_csv)


households <- data.frame(df_combined[[1]])
products <- data.frame(df_combined[[2]])
transactions <- data.frame(df_combined[[3]])

Understanding the structure of the datsets is very important before we begin cleaning the datasets. We can investigate the datasets by various built in functions that are provided in R. So, getting the feel of the data using the following functions.

  • dim
  • names
  • head
  • tail
  • glimpse
  • summary
  • is.na

Looking at the structure of household, product and transaction datasets.

households[households == ""] <- NA
products[products == ""] <- NA
transactions[transactions == ""] <- NA


map(df_combined,dim)
map(df_combined,names)
map(df_combined,head)
map(df_combined,tail)
map(df_combined,glimpse)
map(df_combined,~any(is.na(.)))
map(df_combined,~colSums(is.na(.)))
map(df_combined,summary)

After importing the datasets, in bulk of the columns we can see that were a lot of missing values represented by null or Unknown. But, as R recognises the missing values only by NA so replacing the missing values by NA.

for (i in 1:ncol(households)) {
  households[,i][households[,i ] == "null" | households[,i ] == "Unknown" | 
                   households[,i ] == "NOT AVAILABLE" ] <- NA
  
}

sum(is.na(households))
sum(complete.cases(households))

There were variables in the datasets that were categorical in nature but were being represented by character datatype so it was necessary to convert the variables in FACTOR datatype. Using the type conversion method, converting the variables in apprpriate format.

Converting the household variables in FACTOR format.

for (i in c(2,4,6,7)) {

  households[,i] <- as.factor(households[,i])
}



unique(households$INCOME_RANGE)
households$INCOME_RANGE <- factor(households$INCOME_RANGE, 
                                       order = TRUE, levels = 
                                        c("UNDER 35K","35-49K","50-74K","75-99K",
                                          "100-150K","150K+"))
households$AGE_RANGE <- str_trim(households$AGE_RANGE)
unique(households$AGE_RANGE)
households$AGE_RANGE <- factor(households$AGE_RANGE, 
                                  order = TRUE, levels = 
                                    c("19-24","25-34","35-44","45-54",
                                      "55-64","65-74","75+"))
households$CHILDREN <- factor(households$CHILDREN, 
                                  order = TRUE, levels = 
                                    c("1","2","3+"))
households$HH_SIZE <- factor(households$HH_SIZE, 
                              order = TRUE, levels = 
                                c("1","2","3","4","5+"))

Converting the product variables in FACTOR format.

for (i in 2:5) {
  
  products[,i] <- as.factor(products[,i])
}

Converting the transaction variables in FACTOR format and date format.

transactions$STORE_REGION <- as.factor(transactions$STORE_REGION)
transactions$PURCHASE_DATE <- dmy(transactions$PURCHASE_DATE)
transactions$BASKET_NUM <- as.numeric(transactions$BASKET_NUM)

Looking at the structure of the data after cleaning the data.

glimpse(households)
## Observations: 5,000
## Variables: 9
## $ HSHD_NUM         <chr> "0688", "2590", "1171", "1531", "0403", "0283...
## $ LOYALTY_FLAG     <fct> Y, N, Y, Y, N, Y, Y, Y, N, Y, Y, Y, Y, Y, Y, ...
## $ AGE_RANGE        <ord> 75+, 75+, 75+, 75+, 75+, 25-34, 25-34, 25-34,...
## $ MARITAL_STATUS   <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ INCOME_RANGE     <ord> 35-49K, 50-74K, 75-99K, 75-99K, UNDER 35K, 50...
## $ HOMEOWNER_DESC   <fct> Homeowner, Homeowner, Homeowner, Homeowner, R...
## $ HSHD_COMPOSITION <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ HH_SIZE          <ord> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ CHILDREN         <ord> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
glimpse(products)
## Observations: 151,141
## Variables: 5
## $ PRODUCT_NUM  <chr> "0092993", "0093924", "0094272", "0094299", "0094...
## $ DEPARTMENT   <fct> NON-FOOD, NON-FOOD, NON-FOOD, NON-FOOD, NON-FOOD,...
## $ COMMODITY    <fct> PET, PET, PET, PET, PET, PET, PET, PET, PET, PET,...
## $ BRAND_TYPE   <fct> PRIVATE, PRIVATE, PRIVATE, PRIVATE, PRIVATE, PRIV...
## $ ORGANIC_FLAG <fct> N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N...
glimpse(transactions)
## Observations: 10,625,553
## Variables: 9
## $ BASKET_NUM    <dbl> 24, 24, 34, 60, 60, 168, 199, 252, 355, 366, 379...
## $ HSHD_NUM      <chr> "1809", "1809", "1253", "1595", "1595", "3393", ...
## $ PURCHASE_DATE <date> 2016-01-03, 2016-01-03, 2016-01-03, 2016-01-03,...
## $ PRODUCT_NUM   <chr> "5817389", "5829886", "0539501", "5260099", "453...
## $ SPEND         <dbl> -1.50, -1.50, 2.19, 0.99, 2.50, 4.50, 3.49, 2.79...
## $ UNITS         <int> -1, -1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1...
## $ STORE_REGION  <fct> SOUTH, SOUTH, EAST, WEST, WEST, SOUTH, SOUTH, SO...
## $ WEEK_NUM      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ YEAR          <int> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, ...

Summary

Kroger’s products are approximately equally divided into food and non-food categories with food category containing 74,958 products and non-food category containing 73,134 products. Spend ranges from -14 to 300. There was a huge number of values of spend in negative so assuming that goods have been returned by the customer and there is areverse entry instead of removing it. Units also have a similar case. They range from -26 to 246 so assuming customers have returned the units. Similarly, the maximum week number is 104 which as per the document provided should not be possible as the maximum week mentioned in the document is 102. But, again keeping it as it is as there are a large number of rows containing the week number as 104. Finally, there were a lot of missing values in certain colums so instead of removing it I decided to keep as it is as it can lead to a bias in the final conclusion without understanding the business conetext.

summary(households, maxsum = 15)
summary(products, maxsum = 15)
summary(transactions)

Displaying top 50 observations of household dataset.

households_top50 <- head(households, n = 50)
datatable(
  households_top50, rownames = FALSE,
  extensions = 'Buttons', options = list(
    dom = 'Bfrtip',
    buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
  )
)

Displaying top 50 observations of product dataset.

products_top50 <- head(products, n = 50)
datatable(
  products_top50, rownames = FALSE,
  extensions = 'Buttons', options = list(
    dom = 'Bfrtip',
    buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
  )
)

Displaying top 50 observations of transaction dataset.

transactions_top50 <- head(transactions, n = 50)
datatable(
  transactions_top50, rownames = FALSE,
  extensions = 'Buttons', options = list(
    dom = 'Bfrtip',
    buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
  )
)

Combining all the three datasets with the help of inner join & Displaying top 50 rows of the combinded dataset

merged_data <- transactions %>% 
  inner_join(products, by = "PRODUCT_NUM") %>%
  inner_join(households, by = "HSHD_NUM")

merged_data_top50  <- head(merged_data, n = 50)
datatable(
  merged_data_top50, rownames = FALSE,
  extensions = 'Buttons', options = list(
    dom = 'Bfrtip',
    buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
  )
)

Exploratory Analysis

Page 1

Exploratory Data Analysis

The below graph (Spending vs Units) tells about the spending patterns of households falling under different income brackets. The data is aggreagted for both the years for each household. We can clearly see that households falling under higher income bracket spend more compared to the houses falling in the lower income brackets over the course of two years. Thus, the company can possibly provide more incentives to the households falling in the lower income bracket so that they can increase their spending.

purchase_year <- as.numeric(merged_data$PURCHASE_DATE %>% str_sub(start = 1, end = 4))
merged_data <- merged_data %>% mutate(Year = purchase_year)

spending <- merged_data %>% 
  dplyr::select(HSHD_NUM, SPEND, INCOME_RANGE, UNITS, LOYALTY_FLAG, 
                AGE_RANGE,MARITAL_STATUS, HOMEOWNER_DESC,
                HSHD_COMPOSITION, HH_SIZE, CHILDREN) %>%
  group_by(HSHD_NUM) %>% 
  dplyr::summarise(TOT_SPEND = sum(SPEND), INCOME_RANGE = unique(INCOME_RANGE), 
            TOT_UNITS  = sum(UNITS), LOYALTY_FLAG = unique(LOYALTY_FLAG), 
            AGE_RANGE = unique(AGE_RANGE), MARITAL_STATUS = unique(MARITAL_STATUS), 
            HOMEOWNER_DESC = unique(HOMEOWNER_DESC), 
            HSHD_COMPOSITION = unique(HSHD_COMPOSITION),
            HH_SIZE = unique(HH_SIZE), CHILDREN = unique(CHILDREN))



a <- ggplot(spending, aes(y = TOT_SPEND, x = TOT_UNITS
                        , color = INCOME_RANGE)) +
  geom_point(alpha = 0.5, position = "jitter") +
  geom_smooth(method = "lm", se = FALSE) + 
  labs(x = "Total Units",
              y = "Total Spending of individual households",
              title = "Income wise Household Spending - 2016 & 2017")

ggplotly(a)

Page 2

Exploratory Data Analysis

The below graph shows the monthly spending of the households in the year 2016 and year 2017. We can clearly see from the graph that there is seasonality. Also, in certain months the spending is more in 2016 compared to 2017 and in certain months the spending is less in 2016 compared to 2017. Smooth lines indicate that the spending in 2017 was strong in the initial months but has taperred off in the later months compared to the year 2016.

purchase_yemo <- merged_data$PURCHASE_DATE %>% str_sub(start = 1, end = 7)

merged_data <- merged_data %>% mutate(YeMo = purchase_yemo)
  
spending_timeseries <- merged_data %>% 
  dplyr::select(SPEND, YeMo, Year) %>%
  group_by(YeMo) %>% 
  dplyr::summarise(tot_spend = sum(SPEND), YE = unique(Year))

b <- ggplot(spending_timeseries, aes(y = tot_spend, 
                    x = as.Date(as.yearmon(YeMo)),
                    color = as.factor(YE))) +
  geom_point(alpha = 0.5, position = "jitter") +
  geom_line() +
  geom_smooth(method = "lm", se = FALSE) +
  geom_smooth(method = "lm", se = FALSE, aes(group = 1, col = "All"), linetype = 2) +
  labs(x = "Monthly Index",
       y = "Monthwise Spending",
       title = "Household Spending month by month- 2016 & 2017", 
       color = "Year")

ggplotly(b)

Page 3

Exploratory Data Analysis

The graph below tells us about the monthly spending of organic products in the year 2016 and 2017. We can clearly see that the spending has increased in 2017 for the organic products. The reason might be that people are getting more concious about their health. Another reason might be that the companies have started promoting their organic products more compared to the regular products.

spending_timeseries_org <- merged_data %>% 
  dplyr::select(SPEND, YeMo, YEAR, ORGANIC_FLAG) %>% 
  group_by(YeMo, ORGANIC_FLAG) %>% 
  dplyr::summarise(tot_spend = sum(SPEND), YE = unique(YEAR))

c <-  ggplot(spending_timeseries_org, aes(y = tot_spend, 
                                                x = as.Date(as.yearmon(YeMo)),
                                                color = as.factor(YE))) +
  geom_point(alpha = 0.5, position = "jitter") +
  geom_line() +
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(. ~ ORGANIC_FLAG, scales = "free_y") +
  labs(x = "Monthly Index",
       y = "Monthwise Spending",
       title = "Monthly spending for orgainic products and non-organic products", 
       color = "Year")

ggplotly(c)

Page 4

Exploratory Data Analysis

The below graph tells us the region wise spending for 4 different regions. We can see that in the lower spending brackets there are lot of households dominated by the Southern region. The spending is centered close to each other for all the four regions except that the Eastern region’s center is slightly towards right compared to other regions. But,we can see a very thin line in the region of 2500 to 3000 dominated by households by the Western region.

summar <- merged_data %>% 
  dplyr::select(HSHD_NUM, SPEND, Year, STORE_REGION) %>%
  group_by(STORE_REGION, HSHD_NUM, Year) %>% 
  dplyr::summarise(tot_spend = sum(SPEND)) 
  

d <- ggplot(summar,aes(x = tot_spend, fill = STORE_REGION)) + 
  geom_density(col = NA, alpha = 0.35) +
  labs(x = "Household Spending",
       y = "Density",
       title = "Region Wise Spending", 
       color = "Store Region")

ggplotly(d)

Page 5

Exploratory Data Analysis

Finally, this graph tells us about the spending for different commodities for the year 2016 and 2017. We can clearly see that the households spend the most on grocery.

summar2 <- merged_data %>% 
  dplyr::select(COMMODITY, SPEND, Year) %>%
  group_by(COMMODITY, Year) %>% 
  dplyr::summarise(tot_spend = sum(SPEND))

summar2$Year <- as.factor(summar2$Year)

e <- ggplot(summar2, aes(x = tot_spend, y = COMMODITY, col = Year)) +
  geom_point(scale = "free_y", space = "free_y", position = "jitter", alpha = 0.4) +
  labs(x = "Spending",
       y = "Commodity Type",
       title = "Yearly Spending commodity wise", 
       color = "Year") +
     theme(axis.text.y = element_text(size = 8))


ggplotly(e)

Descriptive Techniques

Market Basket Analysis

Market Basket Analysis

tr <- read.transactions("data/transactions.csv",format = "basket", sep = ",")



itemFrequencyPlot(tr,topN = 20,type = "relative",col = brewer.pal(8,'Pastel2'), 
                  main = "Absolute Item Frequency Plot")

association_rules <- apriori(tr, parameter = list(supp = 0.15, conf = 0.8, maxlen = 10,
                                                  target = "rules"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5    0.15      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 150607 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[43 item(s), 1004051 transaction(s)] done [0.61s].
## sorting and recoding items ... [9 item(s)] done [0.06s].
## creating transaction tree ... done [1.01s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [11 rule(s)] done [0.00s].
## creating S4 object  ... done [0.10s].
summary(association_rules)
## set of 11 rules
## 
## rule length distribution (lhs + rhs):sizes
## 2 3 
## 5 6 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   3.000   2.545   3.000   3.000 
## 
## summary of quality measures:
##     support         confidence          lift           count       
##  Min.   :0.1563   Min.   :0.8027   Min.   :1.232   Min.   :156960  
##  1st Qu.:0.1771   1st Qu.:0.8355   1st Qu.:1.282   1st Qu.:177832  
##  Median :0.1949   Median :0.9183   Median :1.409   Median :195669  
##  Mean   :0.2258   Mean   :0.8802   Mean   :1.351   Mean   :226701  
##  3rd Qu.:0.2531   3rd Qu.:0.9238   3rd Qu.:1.418   3rd Qu.:254106  
##  Max.   :0.3428   Max.   :0.9264   Max.   :1.422   Max.   :344195  
## 
## mining info:
##  data ntransactions support confidence
##    tr       1004051    0.15        0.8
inspect(association_rules[1:10])
##      lhs                             rhs              support   confidence
## [1]  {HOUSEHOLD}                  => {GROCERY STAPLE} 0.1948795 0.8027117 
## [2]  {FROZEN FOOD}                => {GROCERY STAPLE} 0.2418503 0.8163423 
## [3]  {BAKERY}                     => {GROCERY STAPLE} 0.2643113 0.8387680 
## [4]  {DAIRY}                      => {GROCERY STAPLE} 0.3394828 0.8544413 
## [5]  {PRODUCE}                    => {GROCERY STAPLE} 0.3428063 0.8322996 
## [6]  {BEVERAGE - NON WATER,DAIRY} => {GROCERY STAPLE} 0.1563267 0.9263511 
## [7]  {DAIRY,FROZEN FOOD}          => {GROCERY STAPLE} 0.1713967 0.9264407 
## [8]  {FROZEN FOOD,PRODUCE}        => {GROCERY STAPLE} 0.1649468 0.9232635 
## [9]  {BAKERY,DAIRY}               => {GROCERY STAPLE} 0.1869646 0.9242654 
## [10] {BAKERY,PRODUCE}             => {GROCERY STAPLE} 0.1828323 0.9182553 
##      lift     count 
## [1]  1.232053 195669
## [2]  1.252974 242830
## [3]  1.287394 265382
## [4]  1.311451 340858
## [5]  1.277466 344195
## [6]  1.421823 156960
## [7]  1.421960 172091
## [8]  1.417084 165615
## [9]  1.418621 187722
## [10] 1.409397 183573
subRules <- association_rules[quality(association_rules)$confidence > 0.8]

top10subRules <- head(association_rules, n = 10, by = "confidence")



plot(top10subRules, method = "graph",  engine = "htmlwidget")

RFM Analysis

Recency, Frequency and Monetary Value grouping based on K-Means Clustering

purchase_year <- as.numeric(merged_data$PURCHASE_DATE %>% str_sub(start = 1, end = 4))
merged_data <- merged_data %>% mutate(Year = purchase_year)


recent_date <- max(merged_data$PURCHASE_DATE) + 1



df_RFM <- merged_data %>% 
  filter(Year == 2017) %>%
  group_by(HSHD_NUM) %>% 
  dplyr::summarise(Recency = as.numeric(recent_date - max(PURCHASE_DATE)),
            Frequency = length(unique(BASKET_NUM)), 
            MonetaryValue = sum(SPEND)) 
  
df_RFM[,-1] %>% 
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_density() + 
  theme_minimal() 

df_RFM$ScaledRecency <- scale(log(df_RFM$Recency))
df_RFM$ScaledFrequency <- scale(log(df_RFM$Frequency))
df_RFM$ScaledMonetaryValue <- scale(log(df_RFM$MonetaryValue))





wss <- vector("numeric",15)
for (i in 1:15) {
  km_out <- kmeans(df_RFM[,5:7], centers = i, nstart = 20, iter.max = 50)
  wss[i] <- km_out$withinss
}

plot(1:15,wss,type = "b", xlab = 'Number of Clusters', 
     ylab = "Within group of sum of squares")

km_out1 <- kmeans(df_RFM[,5:7], centers = 3, nstart = 20, iter.max = 50)




clustered_df <- data.frame(HSHD_NUM = df_RFM$HSHD_NUM, Recency = df_RFM$Recency, 
                           Frequency = df_RFM$Frequency, 
                           MonetaryValue = df_RFM$MonetaryValue, Cluster = km_out1$cluster)

summary_table <- clustered_df %>% 
  dplyr::select(Cluster, Recency,Frequency,MonetaryValue) %>%
  group_by(Cluster) %>% 
  dplyr::summarise(Recency = mean(Recency), Frequecy = mean(Frequency), 
                   Monetaryvalue = mean(MonetaryValue), 
                   Number_Household = length(Cluster))
  
gathered_table <- gather(summary_table[1:4], RFM, Mean_value, -Cluster)

ggplot(gathered_table, aes(y = Mean_value, x = RFM, color = as.factor(Cluster))) +
  geom_line(aes(group = Cluster)) + 
  scale_y_log10() + 
  labs(x = "RFM",
       y = "Mean RFM Value",
       title = "K-Means Clutering based on RFM")


Stepwise Selection

Stepwise Selection of predictors impacting total houehold spending

The technique that I have used is Stepwise Selection to come up with the best predictors to apply Linear Regression. The purpose of using linear regression is that here we are trying to find the relationship between the spending variable (response) and different predictors. The predictors that were shortlisted based on STepwise Selection were income range, unit, household description and marital status.

linear_data <- merged_data %>% 
  dplyr::select(HSHD_NUM,AGE_RANGE,MARITAL_STATUS,
                               INCOME_RANGE,HSHD_COMPOSITION,HH_SIZE,
                               CHILDREN, SPEND, LOYALTY_FLAG, UNITS, 
                               STORE_REGION, HOMEOWNER_DESC) %>% 
  group_by(HSHD_NUM) %>% 
  dplyr::summarise(SPEND = sum(SPEND),
            INCOME_RANGE = unique(INCOME_RANGE),LOYALTY_FLAG = unique(LOYALTY_FLAG),
            AGE_RANGE = unique(AGE_RANGE),
            CHILDREN = unique(CHILDREN), HOMEOWNER_DESC = unique(HOMEOWNER_DESC),
            MARITAL_STATUS = unique(MARITAL_STATUS), 
            HSHD_COMPOSITION = unique(HSHD_COMPOSITION), HH_SIZE = unique(HH_SIZE),
            UNITS = sum(UNITS))

linear_data <- linear_data[,-1]

map_dbl(linear_data, function(x){sum(is.na(x))}) %>% 
  sort(decreasing = TRUE) %>% .[. > 0]
##         CHILDREN   HOMEOWNER_DESC   MARITAL_STATUS        AGE_RANGE 
##             3075             1246             1207              997 
## HSHD_COMPOSITION          HH_SIZE     INCOME_RANGE 
##              934              933              865
linear_data <- linear_data[,-5]

entire_model <- lm(SPEND~., data = na.omit(linear_data))
null_model <- lm(SPEND~1, data = na.omit(linear_data))



stepwise_selection_full <- step(null_model, scope = list(lower = null_model
                                                         ,upper = entire_model),
                                direction = "both")
## Start:  AIC=59409.38
## SPEND ~ 1
## 
##                    Df  Sum of Sq        RSS   AIC
## + UNITS             1 1.3818e+11 1.4404e+10 51457
## + LOYALTY_FLAG      1 4.0783e+09 1.4850e+11 59320
## + HH_SIZE           4 1.5267e+09 1.5106e+11 59383
## + AGE_RANGE         6 1.1104e+09 1.5147e+11 59397
## + HSHD_COMPOSITION  5 6.0423e+08 1.5198e+11 59406
## <none>                           1.5258e+11 59409
## + MARITAL_STATUS    1 2.3418e+07 1.5256e+11 59411
## + INCOME_RANGE      5 3.8384e+08 1.5220e+11 59411
## + HOMEOWNER_DESC    1 1.6016e+05 1.5258e+11 59411
## 
## Step:  AIC=51457.43
## SPEND ~ UNITS
## 
##                    Df  Sum of Sq        RSS   AIC
## + INCOME_RANGE      5 5.7999e+08 1.3824e+10 51329
## + HOMEOWNER_DESC    1 1.0488e+08 1.4299e+10 51435
## + HSHD_COMPOSITION  5 9.3793e+07 1.4310e+10 51445
## + MARITAL_STATUS    1 5.0398e+07 1.4353e+10 51448
## + HH_SIZE           4 4.3577e+07 1.4360e+10 51455
## + AGE_RANGE         6 5.2266e+07 1.4351e+10 51457
## + LOYALTY_FLAG      1 9.1878e+06 1.4394e+10 51457
## <none>                           1.4404e+10 51457
## - UNITS             1 1.3818e+11 1.5258e+11 59409
## 
## Step:  AIC=51328.92
## SPEND ~ UNITS + INCOME_RANGE
## 
##                    Df  Sum of Sq        RSS   AIC
## + HOMEOWNER_DESC    1 6.1786e+07 1.3762e+10 51316
## + MARITAL_STATUS    1 2.3277e+07 1.3800e+10 51325
## + AGE_RANGE         6 5.3600e+07 1.3770e+10 51328
## + HSHD_COMPOSITION  5 4.1057e+07 1.3783e+10 51329
## <none>                           1.3824e+10 51329
## + LOYALTY_FLAG      1 2.8267e+06 1.3821e+10 51330
## + HH_SIZE           4 1.4566e+07 1.3809e+10 51333
## - INCOME_RANGE      5 5.7999e+08 1.4404e+10 51457
## - UNITS             1 1.3838e+11 1.5220e+11 59411
## 
## Step:  AIC=51315.83
## SPEND ~ UNITS + INCOME_RANGE + HOMEOWNER_DESC
## 
##                    Df  Sum of Sq        RSS   AIC
## + MARITAL_STATUS    1 8.7127e+06 1.3753e+10 51316
## <none>                           1.3762e+10 51316
## + LOYALTY_FLAG      1 2.6717e+06 1.3759e+10 51317
## + HSHD_COMPOSITION  5 3.4185e+07 1.3728e+10 51317
## + AGE_RANGE         6 3.6780e+07 1.3725e+10 51319
## + HH_SIZE           4 8.3707e+06 1.3754e+10 51322
## - HOMEOWNER_DESC    1 6.1786e+07 1.3824e+10 51329
## - INCOME_RANGE      5 5.3690e+08 1.4299e+10 51435
## - UNITS             1 1.3844e+11 1.5220e+11 59413
## 
## Step:  AIC=51315.69
## SPEND ~ UNITS + INCOME_RANGE + HOMEOWNER_DESC + MARITAL_STATUS
## 
##                    Df  Sum of Sq        RSS   AIC
## <none>                           1.3753e+10 51316
## - MARITAL_STATUS    1 8.7127e+06 1.3762e+10 51316
## + LOYALTY_FLAG      1 2.3356e+06 1.3751e+10 51317
## + HSHD_COMPOSITION  4 2.5472e+07 1.3728e+10 51317
## + AGE_RANGE         6 3.2884e+07 1.3720e+10 51320
## + HH_SIZE           4 1.1109e+07 1.3742e+10 51321
## - HOMEOWNER_DESC    1 4.7222e+07 1.3800e+10 51325
## - INCOME_RANGE      5 5.2434e+08 1.4278e+10 51432
## - UNITS             1 1.3843e+11 1.5218e+11 59414
best_model <- lm(SPEND~ INCOME_RANGE + UNITS + HOMEOWNER_DESC + MARITAL_STATUS, 
                 data = na.omit(linear_data))

summary(best_model)
## 
## Call:
## lm(formula = SPEND ~ INCOME_RANGE + UNITS + HOMEOWNER_DESC + 
##     MARITAL_STATUS, data = na.omit(linear_data))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18874.8   -854.3   -120.6    744.8  30992.5 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           322.43711   59.66965   5.404 6.98e-08 ***
## INCOME_RANGE.L       1030.94442   96.64390  10.667  < 2e-16 ***
## INCOME_RANGE.Q        -22.41218   94.34536  -0.238 0.812241    
## INCOME_RANGE.C        -85.39749   89.22845  -0.957 0.338603    
## INCOME_RANGE^4         76.83613   85.81353   0.895 0.370646    
## INCOME_RANGE^5        -32.53545   84.93640  -0.383 0.701702    
## UNITS                   2.72862    0.01484 183.925  < 2e-16 ***
## HOMEOWNER_DESCRenter -370.50425  109.06625  -3.397 0.000689 ***
## MARITAL_STATUSSingle -112.60097   77.16717  -1.459 0.144608    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2023 on 3361 degrees of freedom
## Multiple R-squared:  0.9099, Adjusted R-squared:  0.9096 
## F-statistic:  4241 on 8 and 3361 DF,  p-value: < 2.2e-16

Cohort Analysis

Cohort Analysis

cohort_data <- merged_data  %>% 
  dplyr::select(HSHD_NUM,PURCHASE_DATE,Year)

cohort_data$PURCHASE_DATE <- ymd(format(cohort_data$PURCHASE_DATE, "%Y-%m-01"))

Join_Date <- cohort_data %>%
     group_by(HSHD_NUM) %>% 
    dplyr::summarise(Joindate = min(PURCHASE_DATE))
    



cohort_data <- left_join(cohort_data,Join_Date, by = "HSHD_NUM")

years_diff <- as.numeric(format(cohort_data$PURCHASE_DATE,'%Y')) - 
                 as.numeric(format(cohort_data$Joindate,'%Y'))

months_diff <- as.numeric(format(cohort_data$PURCHASE_DATE,'%m')) - 
                    as.numeric(format(cohort_data$Joindate,'%m'))

cohort_data$monthindex <- years_diff * 12 + months_diff + 1

cohort_data$JoinMonth <- format(cohort_data$Joindate,"%Y-%m")


cohort_final <- cohort_data %>% group_by(JoinMonth, monthindex) %>%
  dplyr::summarise(hshd = length(unique(HSHD_NUM)))

cohort_spread <- spread(cohort_final,monthindex,hshd)

for (i in 3:25) {
  
cohort_spread[,i] <- round(((cohort_spread[,i] / cohort_spread[,2]) * 100), 1)

}

cohort_spread[,2] <- round(((cohort_spread[,2] / cohort_spread[,2]) * 100),1)

cohort_spread <- cohort_spread[2:23,]

breaks <- quantile(data.frame(cohort_spread[,2:24]), probs = seq(.05, .95, .05), 
                   na.rm = TRUE)
colors <- sapply(round(seq(155, 80, length.out = length(breaks) + 1), 0),
                  function(x){ rgb(x,x,155, maxColorValue = 155) } )


# The Retention Mixpanel with counts
DT::datatable(cohort_spread,
              class = 'cell-border stripe',
              rownames = FALSE,
              options = list(
                ordering=F,
                dom = 't',
                pageLength = 23) ) %>%
  formatStyle("JoinMonth",
              backgroundColor = 'lightgrey',
              fontWeight = 'bold') %>%
  formatStyle(names(cohort_spread[-1]),fontWeight = 'bold',
              color = 'white', backgroundColor = styleInterval(breaks,colors))